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(54) Title: HIGHLY AVAILABLE FILE SERVERS 
I (57) Abstract 

i The invention provides a storage system that is highly available 
even in the face of component failures in the storage system, and a 
method for operating that storage system. A first and a second fi e 
server each includes a file server request log for storing incoming file 
server requests. Both the first and second file servers have access to a 
common set of mass storage elements. Each incoming file server request 
is copied to both the first and second file servers; the first file server 
processes the file server request while the second file server maintains a 
copy in its file server request log. Each file server operates using a file 
system that maintains consistent state after each file server request. On 
failover the second file server can perform those file server requests in 
its file server request log since the most recent consistent state. There 
is no single point of failure that prevents access to any individual mass 
storage element. 



SYSTEM 
100 




FOR THE PURPOSES OF INFORMATION ONLY 
Codes used to identify States patty to the PCX on the front pages of pamphlets publishing 



international applications under the PCT. 



AL Albania 

AM Armenia 

AT Austria 

AU Australia 

AZ Azerbaijan 

BA Bosnia and Herzegovina 

BB Barbados 

BE Belgium 

BF BurVina Paso 

BG Bulgaria 

BJ Benin 

BR Brazil 

BY Belarus 

CA Canada 

CF Central African Republic 

CG Congo 

CH Switzerland 

CI Cote d*Ivoire 

CM Cameroon 

CN China 

CU Cuba 

CZ Czech Republic 

DE Germany 

DK Denmark 

EE Estonia 



ES 


Spain 


FI 


Finland 


FR 


France 


CA 


Gabon 


GB 


United Kingdom 


GE 


Georgia 


GH 


Ghana 


GN 


Guinea 


CR 


Greece 


HU 


Hungary 


IE 


Ireland 


IL 


Israel 


IS 


Iceland 


IT 


Italy 


JP 


Japan 


KE 


Kenya 


KG 


Kyrgyzstan 


KP 


Democratic People 




Republic of Korea 


KR 


Republic of Korea 


KZ 


Kazakstan 


IX 


Saint Lucia 


U 


Liechtenstein 


LK 


Sri Lanka 


LR 


Liberia 



LS 

LT 

LU 

LV 

MC 

MD 

MG 

MK 

ML 

MN 

MR 

MW 

MX 

NE 

NL 

NO 

NZ 

PL 

FT 

RO 

RU 

SD 

SE 

SG 



Lesotho 
Lithuania 
Luxembourg 
Latvia 
Monaco 

Republic of Moldova 

Madagascar 

The former Yugoslav 

Republic of Macedonia 

Mali 

Mongolia 

Mauritania 

Malawi 

Mexico 

Niger 

Netherlands 

Norway 

New Zealand 

Poland 

Portugal 

Romania 

Russian Federation 

Sudan 

Sweden 

Singapore 



SI 


Slovenia 


SK 


Slovakia 


SN 


Senegal 


SZ 


Swaziland 


TD 


Chad 


TG 


Togo 


TJ 


Tajikistan 


TM 


Turkmenistan 


TR 


Turkey 


TT 


Trinidad and Tobago 


UA 


Ukraine 


UG 


Uganda 


US 


United States of America 


uz 


Uzbekistan 


VN 


Viet Nam 


YU 


Yugoslavia 


zvv 


Zimbabwe 



I 



PCT/US99/05071 

WO 99/46680 



1 

2 
3 
4 
5 
6 

7 l Field of the Invention 



TitU nf the Invention 
Highly Available File Servers 
p^l^nnH nf the Invention 
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n 2. Related Art 



The invention relates to storage systems. 
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Con.pu.er storage systems are used to record and retrieve da... In some compute, 
systems, storage systems — icate with a set of c.ien, devices, and provide serves for re- 
cording and retrieving data to those client device, Because data storage is important to many 
app,ica,i„ns, it is desirable for the services and data provided b, the storage system to be ava,.- 
„ abie for servtce to the greatest degree possibie. h is therefore desirabie to provide storage sy, 
„ tems ma, <=an remain ,vai,ab,e for service even in the face of component Mures in me storage 



19 system. 
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One known technique forprovide storage systems that can remain available for 
service is to provide a plurality of redundan, storage Cements, with the proper* mat when a firs, 
storage Cement fails, a second storage dement b available to provide the servtces ^ the d. 
otherwise provided by the fust. Transfer of the functton of providing service from me first to 
,h. second storage dement is cal.ed fauover" The second storage Cement maintatns a copy 
me data maintain*, by ,h. first, so that H» can proceed without substtMial interruphon. 

A firs, known technique for achieving failover is to cause me second storage ele- 
men, to copy all the operations of the first. Thus, each storage operation corned by me to 
storage Cement is aiso combed by me second. Tnis first known technique ts su ,ect , *aw- 
backs- (!) U uses a subsutntia. amount of processing power a. the second storage ...men, duph- 
eating .fforts of the firs,, mos, of which is wasted. (2, ., slow, the firs, storage ...men, ,n con- 
firming comp...ion of operations, bemuse the firs, storage e..m.n. waits for th. s«o»d » a,so 
34 complete the same operations. 
35 
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, A second known .echniqu. for achieving failover is to identify » sequence of 

, checkpoints a, which ft. first storage element is a. a consistent and known stt,e. On fatiover, 

, the second storage element can continue operation from the most recent checkpoint. For exam- 

4 nle the NFS (Network File System) protocol requires ail write operations to be stored to d,sk 

5 before they are confirmed, so .ha. confcmation of a wri.e option indicates a stable file system 
, configurator. This second known technique is subject to drawbacks: (!) ■. siows ,he firs, sto, 
7 age element in performing write opera,ions, because ,he ft., storage element wai B for wr«e op- 
. erations to be completely stored to disk. (2) .. slows recovery on faitover, because the secon 
, storage element addresses an, inconsistencies left by failure of the first tenveen .denied 
10 checkpoints. 

" Accordingly, it would be advantageous to provide a storage system, and a method 

„ fo, operating a storage system, that efficiently uses al, storage system elements, quickly com- 
, 4 pletes and confirms operations, a«i quickly recovers from failure of any storage element. Tnts 
,3 advantage is achieved in an embodiment of the tnvention in which the storage system tmole- 
„ ments frequent and rapid checkpoints, and in which the storage system rapidly disuibu.es dup„- 
ca,. commands for those operations between checkpoints among its s.„ra g e elements. 
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Summar y of th e Invention 

The invention provides a storage system that is highly available even in the face 
of component failuKS in the storage system, and a meftod for opening ft* storage system. A 
firs, and a second file server each includes a file server request log for storing ,„com,ng file 
server ^ Bod, ,he firs, and second file servers have access ,0 a common se, of mass stor- 
age element Each incoming file serve, reques, is eopred .o bod, ,he firs, and second fi,e serv- 
L, me firs, file server processes the file serve, request wh.le the second fi,e server mamtams a 
copy in its file serve, reques. log. Each file server opera.es using a file s,s,em ma, »— 
coHiston, ~ after each file server requcs,. On failove, me second fi,e server ^perform 
those file server request in i,s file server reques, log since the rnos, rccen, cons,stent s,a,e. 

,„ , second aspec, of th. invention, a file server sys,em provides mirroring of one 
or more mass sto,age e,emen,s. Each mcoming file server request is copied to both fte firafile 
serve, and ,h. second f„e server. Tne firs, file server performs fte file serve, reques,s to mod.fi, 
, primary se,of mass storage eleme„,s, and alsoperf.rms dre same file 
ify a mirror se, of mass storage elemen,, The mirror mass storage elements are drsposed phys,- 
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, «„, separ,«e,y from ft. primary mass s.orage elements, such as a, another si.e, and provide a 
resource in the even, d. eniir. primary se, of mass storage e,cme»,s is .o be recovered. 

rwf nescrip* : ™ " f 'he Drawings 

Figure 1 shows a block diagram of a highly available file server system. 

Figure 2 shows a block diagram of a file server in the file server system. 

Figure 3 shows a process flow diagram of operation of the file server system. 

iWI^ rescript '™ ^ Preferred Embodimen t 

, n *efollowin g des^^ 
with regard to preferred process steps and data However, those skilled in ^ 

eouipmem would no. require undue experiment or further invennon. 
File Server Fair and Faihoer Operation 

Figure 1 shows a block diagram of a highly available file server sys.«m. 

A me serve, sysKm .00 includes a pair of file servers .10, bo.h coupled .0 a 
eommon se. of mass s.ora g e devices 1,0. A firs, one of .he file servers 1 ,0 is coupled, a*. 

second one of the file servers 110 is coupled to a second VO bus 
selected subset of the mass storage devices 120. 

Alt hough both file servers 110 are coupled to all of the common storage 

• no onlv one file server 110 operates to control any one mass storage devu* 120 at any 
vices 120, only one file server con trollable by only 

designated time. Thus, even though the mass storage dev.ces 120 are 
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, one fil e server HO at , time, each of the mass storage devices 120 remains availabte even if one 

of its two associated file servers 1 10 fails. 



,„ a preferred embodiment, the file server system 100 includes a pair of such file 
servers 1 10; however, in alternative embodiments, more than two such file servers 1 ,0 may be 
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6 included in a single Hie server system 1 00. 

I„ a preferred embodiment, the first I/O bus 130 and the second I/O bus 130 each 
include a mezzanine bus such as the PCI bus architecture. 

In a preferred embodiment, the mass storage devices 120 include magnetic disk 
„ drives, optical disk drives, or magneto-optical disk drives. In alternative embodiments, however, 

13 other storage systems may be used, such as bubb.e memory, flash memory, or systems « 

14 other storage technologies. Components of the mass storage devices 120 « referred to as 
'disks," even though those components may comprise other forms or shapes. 

Each mass storage device 120 can include a single disk or a plurality of disks. In 
„ a preferred embodiment, each mass storage device 120 includes a plurahty of disks and is d,s- 
19 posed and operated as a RAID storage system. 
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,„ , preferred embodiment, the firs. f,le server .10 is coupled ,„ the second Me 
22 server 1 ,0 using a common intercom, The common interconnect provtdes a remote ^ memory 

aceess capability for each ftle server „ 0, so that d„a can be stored a, each file serve, 1 , 
24 remote location. In a preferred embodiment, the common interconne excludes a Tan dm 
I .-ServerNer interconnect. The common interconnect ,s coupled to each f,.e server 1 ,0 ustng a 
26 device controller coupled .0 an I/O bus for each file server 110. 

The firs, file server 1 .0 is coupled ,0 a firs, network interface .40, which is dis- 
2, .osed ,0 receive file server requests ,5, from a netwo* 150. Similarly, -he second file server 
!0 U0 is coupled to a second netwo* interface 140, which is also disposed to rece.ve file server 
3 1 requests 1 51 from the network 1 50. 

The firs, file server 110 includes a first server request memory 160, which re- 
eeives the file server requests I S, and records them. ,n the even, the firs, file server 1 ,0 reco. 
ers iron, a power Mure „, other service dis.up.ion, *c curding file server reques, 151 ,n 
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, ft. fits, server request memory 160 - r.-P=rtoed ,o income tnem in,o a next consistent 

state of the file system maintained by the first file server 110. 
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Similarly fte second file server 1 10 includes a second server request memory 
,60, whichreceives ft. file server requests 15. and records .hem. In fte .venture second file 
server 110 recover from a power fata, o, ofter service disnrpuon, the ouKUndtng file server 
,eq„es* 15. in fte second server request memory .60 are re-performed to incorporate them tnto 
a next consistent surte of the f,.e system maintained by the second file serve, 110. 

When the first file server 110 receives a file server request 151 from the network 
„ 150 that file server request .51 is copied into fte firs, server request memory .60. Tne file 
,2 server request .5. is also copied into the second server request memory .60 using remote mem- 
Lyaccessoverthecommoninterconnect. Similarly, when fte second file se„er 1 10 recetves a 
,4 Z server .quest .51 from ,h. n.two* .50, tha, f„e server request .5. is copied tnto the sec- 
d server r^ues, memory .60. me file server request ,1 is also copied ,nto fte first,™ 
„ requestmemory .60 using remote memory access over the common interconnect. Ustngretno* 
„ miry access is relatively quicker and has less commutation overhead than ustng . net- 
18 working protocol. 

I in ft. even, that either file server 1.0 fails, ft. o,h.r file server 1 10 can conttnue 

2, pressing using ft. file server requests ,5. stored in its own server request memory 160. 

I ,„ a preferred embodiment, each server request memory 160 includes a nonvola- 

25 due to power failures or ofter service interruptions. 

" The responding file server 110 processes the file server request .5. and possibly 

„ modifies store, fi.es on on. of the mass storage devices .20. Tnc 

29 no, partner ,0 the responding file server ..0. maintains fte fi.e server request I stored n Us 
ser^e request memory ,60 to prepa* for fte possibility that fte respondmg file serve, 0 

" Tgh. J in fte even, fte tesponding file serve, 1 .0 fails, the no,respo„d.ng file server 1,0 

32 processes fte file server reques, 151 aspartof a failove, techmque. 
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In a preferred embodiment, each file server 110 controls its associated mass stor- 
age devices .20 so as to form a redundant array, such as a RAID storage system, using men- 
tions described in the following patent applications: 

o Application Serial No. 08/471,218, filed June 5, 1995, in the name of inventors David 
Hte et al., titled "A Method for Providing Parity in a Raid Sub-System Usmg Non- 
Volatile Memory", attorney docket number NET-004; 

„ Application Serial No. 08/454,921, filed May 3,, 1995, in the name of inventors David 
Hitz et al., titled "Write Anywhere File-System Layout", attorney docket number NET- 
005; 

Application Serial No. 08/464,591, filed May 31. 1995, in the name of inventors David 
Hitz et al., titled "Method for Allocating Files in a File System Integrated with a R*d 
Disk Sub-System", attorney docket number NET-006. 

Each of these applications is hereby incorporated by reference as if fully set forth 
K herein. They are collectively referred to as the "WAFL Disclosures." 
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As par. of .he .echnioues shown in .he WAFL Disclosures, each fl.a serve, 1.0 
, co„„o,s ia associa.ec- mass s,o,a g e devices .20 in response .0 file. server reones. .5, ,n an 
2- ThefinalacUonforanyfileserverreques, ,5, is ,o income .he ra os. recen. 
Hen, in.o ihe file sys.cn, ,2,. T,„s, file svsiem ,2, is in an in.ema lly cons,s,en. si, 
rreo.ple.ionoreachfileserverreoncs.,51. Thns, a file sys.cn, ,2, 

HO Lols those mass s,o,age devices ,20. Excepts >o Ore in.emally co„s,s,e„, «• w.,1 
iTIlude a few of .he mos, recen, file serve, reo,es,s ,«. which wi„ s.i„ he s,o,ed .ft. 
sir ,eo»e, memory .60 ft. bo.h file servers ,10. Those mos. recen. f„e server request .51 

rri^ed «. . — - * p— « - 10 - — 

30 consistent state. 
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F o, any fi.e server reones, 151. in .he even, .he file serve, 1,0 norma,* re- 
sponding ,o ft* file serve, reqnes, .5, fails, .he ofter fiie server 1 ,0 wiU recognize ft. ft*» 
a failovcr me«hod ,o uke co„.rol of mass s,or, E e devices ,20 prcously -^d 
I rlrlgfileserverUO. Ue failove, file se„e, 1,0 wil, fina ftose mass s.orage devrccs 
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1 120 with their file system 121 in an internally consistent state, but with the few most recent file 

2 server requests 151 as yet unperformed. The failover file server 1 10 will have copies of these 

3 most recent file server requests 151 in its server request memory 1 60, and will perform these file 

4 server requests 1 5 1 in response to those copies. 
5 

6 File Server Node 
1 

Figure 2 shows a block diagram of a file server in the file server system. 
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Each file server 110 includes at least one processor 111, a program and data 
memory 112, the server request memory 160 (including a nonvolatile RAM), a network interface 
element 114, and a disk interface element 115. These elements are interconnected using a bus 
1 17 or other known system architecture for communication among processors, memory, and pe- 
14 ripherals. 
15 

In a preferred embodiment, the network interface element 1 14 includes a known 
network interface for operating with the network 150. For example, the network interface ele- 
ment 114 can include an interface for operating with the FDDI interface standard or the 
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After failover, the file server 110 responds to file server requests directed to either 
itself or its (failed) partner file server 110. Each file server 110 is therefore capable of assuming 
an additional network identity on failover, one for itself and one for its failed partner file server 
1 10. In a preferred embodiment, the network interface element 1 14 for each file server 1 10 in- 
eludes a network adapter capable of responding to two separate addresses upon instruction by 
the file server 1 10. In an alternative embodiment, each file server 1 10 may have two such net- 
27 work adapters. 

In a preferred embodiment, the disk interface element 1 1 5 includes a known disk 
interface for operating with magnetic, optical, or magneto-optical disks, that has two independ- 
ent ports with each port coupled to a separate file server 110, such as the FC-AL interface. Thts 
helps prevent failure of one file server 1 10 from affecting low-level operation of the other file 
33 server 110. 
34 
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, l„ a preferred embodiment, the bus 1 17 includes at least a memory bus 171 and 

2 the mezzanine bus 130. The memory bus 171 coupies the processor 1 1 1 and the program and 

3 data memory 112. The mezzanine bus 130 couples the network interface element 114 and the 

4 disk interface element 1 15. The memory bus 171 is coupled to the mezzanine bus 130 using an 

5 I/O controller 173 or other known bus adapter technique. 

' In a preferred embodiment, each disk in the mass storage 120 is statically as- 

8 signed to either the first Hie server 1 10 or the second file server 110, responsive to whether the 
disk is wired for primary control by either the first file server 1 1 0 or the second file server 1 10 
Each disk has two control ports A and B; the file server 1 10 wired to port A has primary control 
of that disk, while the other file server 110 only has control of that disk when the other file 
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12 server 110 has failed. 
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1 4 Operation Process Flow 
15 
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22 system request 151 
23 
24 
25 
26 

27 server 110. 



Figure 3 shows a process flow diagram of operation of the file server system. 

A method 300 is performed by the components of the file server 100, and in- 
eludes a set of flow points and process steps as described herein. 

At a flow point 310, a device coupled to the network 150 desires to make a file 



At a step 31 1, the device transmits the file system request 1 51 to the network 150. 
At a step 312, the network 150 transmits the file server request 151 to the file 
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A, a step 313. a firs, file serve, 110 at the file server system 100 receives the file 
server request 151. The first file server 110 copies the file server request 151 into the first server 
request Lor, 1*0, and also copies the file server request ,5, into the seconl server 
lor, 16 0usm g *ecnnun„ni„_t. T*-******'*-"-*'-^ 
server request memory 160 is to an area reserved for this purpose. The copymg opera ron re. 
;l nTfinther processus * -he second fite server 1,0, and the second file server ,10 does 
not normally process or respond to the file server request 1 5 1 . 
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At a step 314, the first file server 1 10 responds to the file server request 151. 

At a flow point 320, the file server request has been successfully processed. 

In a second aspect of the invention, the first file server 1 10 provides mirroring of 
7 one or more of its mass storage devices 120. 

As with the first aspect of the invention, each incoming file server request is 
copied to both the first file server 1 10 and the second file server 1 10. The first file server 110 
performs the file server requests to modify one or more primary mass storage devices 120 under 
n its control. The first file server 1 10 also performs the file server requests to modify a set of rmr- 
„ ror mass storage devices 120 under its control, but .ocated distant from the primary mass storage 
M devices 120. Thus, the minor mass storage devices 120 will be a substantial copy of the pnmary 
1 5 mass storage devices 1 20. 
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The mirro, set of mass storage devices 120 provide a resource in .he ever,. the 
„ entire primary set of mass storage devices ,20 is ,o be recovered, such as if a disaster befalls the 
19 primary set of mass storage devices 120. 

At a flow point 330, the first file server 1 10 in the file server system 100 fails. 

At a step 331, the second file server 1 10 in the file server system 100 recognizes 
the failure of the first file server 110. 

In a preferred embodiment, the second file server 110 performs the step 331 in 
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27 the following manner: 
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34 of its two mailbox disks. 

35 



Each me sewer 1 .0 maintains two disks of its mass storage devices .20 (thus, there a* a 
total of four such disks for two file servers 1 1 0) fo, recording state information about the 
file server 110. There ate mo such disks (cailed "mailbox disks") so that one can be 
used as primary storage and one can be used as backup storage. If one of me two mail- 
box disks fails, the file server U0 using mat mailbox disk designates anoihe, d,sk . one 
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Each file server 1 10 maintains at least one sector on each mailbox disk, on which the file 
server 110 periodically writes state information. Each file server 1 10 also sends tts state 
information to the other file server 110 using the interconnect using remote memory ac- 
cess. The state information written to the mailbox disks by each file server 1 10 changes 
with each update. 

Each file server 110 periodically reads the state information from at least one of the 
mailbox disks for the other file server 1 10. Each file server 1 10 also receives state ,n- 
formation from the other file server 1 10 using the interconnect using remote memory ac- 
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Each file server 1 10 recognizes if the other file server 1 10 has failed by noting that there 
has been no update to the state information on the mailbox disks for the other file server 



In a preferred embodiment, the second file server 110 determines whether failure 
of the first file server 1 1 0 is a hardware error or a software error, and only recognizes failure of 
the first file server 1 10 for hardware errors. In alternative embodiments, the second file server 
1 10 may recognize failure of the first file server 1 1 0 for software errors as well. 

At a step 332, the second file server 1 10 seizes control of all mass storage devices 
120 previously assigned to the first file server 1 10. Due to the nature of the techniques shown in 
the WAFL Disclosures, the file system 121 defined over those mass storage devices 120 wrll be 
in an internally consistent state. All those file server requests 151 marked completed w.11 have 
been processed and the results incorporated into storage blocks of the mass storage devrces 120. 

in normal operation, neither file server 110 places reservations on any of the mass 
storage devices 120. In the step 332 (only on failover), the second file server 1 1 0 seizes control 
of the mass storage devices 120 previously controlled by the first file server 110, and retams 
control of those mass storage devices 120 until it is satisfied that the first file server 110 has re- 



15 
16 
17 
18 
19 
20 
21 
22 
23 
24 

25 

26 

27 

28 

29 

30 

31 covered. 



32 
33 
34 
35 



When the first file server 1 10 recovers, it sends a recovery message to the second 
fi,e server 1 10. In a preferred embodiment, the second file server 110 relinquishes control of the 
seized mass storage devices 120 by operator command. However, in alternative embodiment,, 
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1 the second file server 1 10 may recognize the recovery message from the first file server 1 10 and 

2 relinquish control of the seized mass storage devices 120 in response thereto. 
3 

4 At a step 333, the second file server 1 10 notes all file server requests 151 in the 

5 area of its server request memory 160 that were copied there by the first file server 1 10. Those 

6 file server requests 151 whose results were already incorporated into storage blocks of the stor- 

7 age devices 120 are discarded. 
8 

9 At a step 334, when the second file server 1 10 reaches its copy of each file server 

10 request 151, the second file server 110 processes the file server request 151 normally. 
M 

12 At a flow point 340, failover from the first file server 1 10 to the second file server 

13 110 has been successfully handled. 
14 

1 5 A Iter native Embodiments 
16 

17 Although preferred embodiments are disclosed herein, many variations are possi- 

18 ble which remain within the concept, scope, and spirit of the invention, and these variations 

19 would become clear to those skilled in the art after perusal of this application. 
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element; 



Claims 

1 . A file server system including 
a first file server including a file server change memory; 
a second file server including a file server change memory; 
a mass storage element; 

said first file server and said second file server being coupled to said mass storage 



means for copying a descriptor of a file system change to both said first and sec- 
ond file servers, whereby said first file server processes said file system change while said sec- 
ond file server maintains its copy of said descriptor in its file server change memory; and 

means for said second file server to perform a file system change in its file server 
change memory in response to a service interruption by said first file server. 

2. A system as in claim 1 , including at least one said mass storage element 
for each said file server. 

3. A system as in claim 1, wherein a first said file server is disposed for 
processing said file system changes atomically, whereby a second said file server can on failover 
process exactly those file system changes not already processed by said first file server. 

4. A system as in claim 1 , wherein a first said file server is disposed to re- 
spond identically to service interruptions for itself and for a second said file server. 

5. A system as in claim 1 , wherein at least one said file server is disposed to 
delay output to said mass storage element without delaying a response to file system changes. 

6. A system as in claim 1 ; wherein at least one said file server responds to a 
file system change before committing a result of said file system change to mass storage. 

7. A system as in claim 1 , wherein 

each one of said file servers is coupled to at least a portion of said file server 
change memory using local memory access; and 

each one of said file servers is coupled to at least a portion of said file server 
change memory using remote memory access. 

12 
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1 

2 8. A system as in claim 1, wherein said descriptor includes a file server re- 

3 quest. 
4 

5 9. A system as in claim 1 , wherein said file server change memory includes 

6 a disk block. 
7 

8 10. A system as in claim 1, wherein said file server change memory includes 

9 a file server request. 
10 

11 1 1 , A system as in claim 1 , wherein said file server change memory is dis- 

12 posed to delay output to said mass storage element without delaying a response to file server re- 

13 quests. 
14 

15 12. A system as in claim 1, wherein 

16 said mass storage element includes a file storage system; 

17 each said file server is disposed for leaving said file storage system in an inter- 

18 nally consistent state after processing file system changes; 

19 said internally consistent state is associated with a set of completed file system 

20 changes; 

21 said set of completed file system changes is identifiable by each said file server. 
22 

23 13. A system as in claim 1, wherein said mass storage element includes a file 

24 storage system and each said file server is disposed for leaving said file storage system in an in- 

25 ternally consistent state after processing each said file system change, 
26 

27 14. A file server system as in claim 1 , wherein 

28 said mass storage element includes a primary mass storage element and a mirror 

29 mass storage element; and 

30 said first file server processes said file system change for both said primary mass 

3 1 storage element and said mirror mass storage element. 

32 

33 1 5. A system as in claim 1 , wherein said means for copying includes access to 

34 at least one of said first and second file server change memories using a NUMA network. 



35 

13 
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12 
13 
14 
15 
16 



20 
21 



16. A system as in claim 1, wherein said means for copying includes remote 
memory access to at least one of said first and second file server change memories. 

17. A system as in claim 1, wherein said means for said second file server to 
perform a file server request in its file server change memory is also operative in response to a 



6 service interruption by said second file server. 



8 18. A file server system including 

9 a first file server coupled to a first set of mass storage devices; 

10 a second file server coupled to a second set of mass storage devices; 

1 1 a server change memory; 
said first file server disposed for receiving a file server request and in response 

thereto copying a descriptor of a file system change into said server change memory; and 

said first file server disposed for processing said file system change for both said 
first set of mass storage devices and for at least one said mass storage device in said second set. 



j 7 1 9. A system as in claim 1 8 , wherein 

18 said second file server is disposed for receiving a file server request and in re- 

19 sponse thereto copying a descriptor of a file system change into said server change memory; and 
said second file server is disposed for processing said file system change for both 

said second set of mass storage devices and for at least one said mass storage device in said first 



22 set 

23 
24 

25 disk block. 



29 
30 
31 



20. A system as in claim 18, wherein said server change memory includes a 



26 

27 21. A system as in claim 18, wherein said server change memory includes a 

28 file server request. 



22. A system as in claim 18, wherein said server change memory includes a 
first portion disposed at said first file server and a second portion disposed at said second file 



32 server. 
33 

34 23 . A system as in claim 1 8, wherein 
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1 said server change memory includes a first portion disposed at said first file 

2 server and a second portion disposed at said second file server; and 

3 said first file server is disposed for copying said descriptor into both said first 

4 portion and said second portion. 
5 

6 24. A system as in claim 1 8, wherein 

7 said server change memory includes a first portion disposed at said first file 

8 server and a second portion disposed at said second file server; and 

9 said first file server and said second file server are each disposed for copying said 

10 descriptor into both said first portion and said second portion. 



n 25. A system as in claim 18, wherein said server change memory is disposed 

13 to delay output to said mass storage element without delaying a response to file server requests. 
14 

1 5 26 . A file server system including 

16 a plurality of file servers, said plurality of file servers coupled to a mass storage 

1 7 element and at least one file server change memory; 

18 each said file server disposed for receiving a file server request and in response 

19 thereto copying a descriptor of a file system change into said file server change memory; and 
each said file server disposed for responding to a service interruption by per- 



20 

21 forming a file system change in said file server change memory. 
22 

23 27. A system as in claim 26, including at least one said mass storage element 

24 for each said file server. 
25 

26 28. A system as in claim 26, including at least one said server change memory 

27 for each said file server. 
28 

29 29. A system as in claim 26, wherein a first said file server is disposed for 

30 processing said file system changes atomically, whereby a second said file server can on failover 

3 1 process exactly those file system changes not already processed by said first file server. 
32 

33 30. A system as in claim 26, wherein a first said file server is disposed to re- 

34 spond identically to service interruptions for itself and for a second said file server. 
35 
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1 3 1 . A system as in claim 26, wherein at least one said file server delays output 

2 to said mass storage element without delaying a response to file server requests. 
3 

4 32. A system as in claim 26, wherein at least one said file server responds to a 

5 file system change before committing a result of said file system change to mass storage. 
6 

7 33. A system as in claim 26, wherein 

8 each one of said file servers is coupled to at least a portion of said file server 

9 change memory using local memory access; and 

10 each one of said file servers is coupled to at least a portion of said file server 

1 1 change memory using remote memory access. 
12 

13 34. A system as in claim 26, wherein each said file server is disposed for 

14 copying said descriptors using a NUMA network. 
15 

16 35. A system as in claim 26, wherein each said file server is disposed for 

17 copying said descriptors using remote memory access. 
18 

19 36. A system as in claim 26, wherein said file server change memory includes 

20 a disk block. 
21 

22 37. A system as in claim 26, wherein said file server change memory includes 

23 a file server request. 
24 

25 38. A system as in claim 26, wherein said file server change memory is dis- 

26 posed to delay output to said mass storage element without delaying a response to file server re- 

27 quests. 
28 

29 39. A system as in claim 26, wherein said mass storage element includes a file 

30 storage system and each said file server is disposed for leaving said file storage system in an in- 

3 1 ternally consistent state after processing each said file system change. 
32 

33 40. A system as in claim 26, wherein 

34 said mass storage element includes a file storage system; 
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1 each said file server is disposed for leaving said file storage system in an inter- 

2 nally consistent state after processing file system changes; 

3 said internally consistent state is associated with a set of completed file system 

4 changes; 

5 said set of completed file system changes is identifiable by each said file server. 
6 

7 4 L A file server system as in claim 26, wherein 

8 said mass storage element includes a primary mass storage element and a mirror 

9 mass storage element; and 

10 said first file server processes said file system change for both said primary mass 

1 1 storage element and said mirror mass storage element. 
12 

13 42. A method of operating a file server system, said method including steps 

14 for 

15 responding to an incoming file server request by copying a descriptor of a file 

16 system change to both a first file server and a second file server; 

17 processing said file system change at said first file server while maintaining said 

18 descriptor copy at said second file server; and 

19 performing, at said second file server, a file system change in response to a cop- 

20 ied descriptor and a service interruption by said first file server. 
21 

22 43. A method as in claim 42, including steps' for associating a first file server 

23 and a second file server with a mass storage element. 

24 

25 44. A method as in claim 42, including steps for delaying output by at least 

26 one said file server to said mass storage system without delaying a response to file system 

27 changes. 
28 

29 45. A method as in claim 42, wherein a first said file server is disposed for 

30 processing said file system changes atomically, whereby a second said file server can on failover 

31 process exactly those file system changes not already processed by said first file server. 
32 

33 46. A method as in claim 42, wherein a first said file server is disposed to re- 

34 spond identically to service interruptions for itself and for a second said file server. 
35 
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1 47. A method as in claim 42, wherein at least one said file server responds to 

2 a file system change before committing a result of said file system change to mass storage. 
3 

4 48. A method as in claim 42, wherein 

5 each said file server includes a file server change memory; 

6 each one of said file servers is coupled to at least a portion of said file server 

7 change memory using local memory access; and 

8 each one of said file servers is coupled to at least a portion of said file server re- 

9 quest memory using remote memory access. 
10 

n 49. A method as in claim 42, wherein said file server change memory in- 

12 eludes a disk block. 

13 

14 50. A method as in claim 42, wherein said file server change memory in- 

1 5 eludes a file server request. 
16 

17 51. A method as in claim 42, wherein said file server change memory is dis- 

18 posed to delay output to said mass storage element without delaying a response to file server re- 

19 quests. 
20 

21 52. A method as in claim 42, wherein said mass storage element includes a 

22 file storage system and each said file server is disposed for leaving said file storage system in an 

23 internally consistent state after processing each said file system change. 

24 

25 53. A method as in claim 42, wherein said steps for performing a file system 

26 change in response to a copied descriptor are also operative in response to a service interruption 

27 by said second file server. 
28 

29 54. A method as in claim 42, wherein said steps for processing includes steps 

30 for processing said file system change at both a primary mass storage element and a mirror mass 

31 storage element. 
32 

33 55. a method of operating a file server system, said method including steps 

34 for 
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1 receiving a file server request at one of a plurality of file servers and in response 

2 thereto copying a descriptor of a file system change into a server change memory; 

3 processing said file system change for both a first set of mass storage devices 

4 coupled to a first one said file server and for at least one said mass storage device in a second set 

5 of mass storage devices coupled to a second one said file server. 
6 

7 56. A method as in claim 56, wherein said descriptor includes a file server 

8 request. 
9 

10 57. A method as in claim 56, wherein said server change memory includes a 

! l disk block. 

12 

13 58. A method as in claim 56, wherein said server change memory includes a 

14 file server request. 
15 

16 59. A method as in claim 56, wherein said server change memory includes a 

17 first portion disposed at said first file server and a second portion disposed at said second file 

1 8 server. 
19 

20 60. A method as in claim 56, wherein said server change memory includes a 

2 i first portion disposed at said first file server and a second portion disposed at said second file 

22 server; and wherein said steps for copying include steps for copying said descriptor into both 

23 said first portion and said second portion. 

24 

25 61. A method as in claim 56, wherein said server change memory includes a 

26 first portion disposed at said first file server and a second portion disposed at said second file 

27 server; and said steps for copying include steps for copying said descriptor into both said first 

28 portion and said second portion by either of said first file server or said second file server. 
29 

30 62. A method as in claim 56, wherein said server change memory is disposed 

3 1 to delay output to said mass storage element without delaying a response to file server requests. 

32 

33 63. A method as in claim 56, wherein 
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1 said steps for receiving include receiving a file server request at either said first 

2 file server or said second file server, and said steps for copying said descriptor include copying 

3 by either said first file server or said second file server; and including steps for 

4 processing said file system change for both said second set of mass storage de- 

5 vices and for at least one said mass storage device in said first set. 
6 

7 64. A method of operating a file server system, said method including steps 

8 for 

9 receiving a file server request at one of a plurality of file servers and in response 

10 thereto copying a descriptor of a file system change into a file server change memory; and 

1 1 responding to a service interruption by performing a file system change in re- 

12 sponse to a descriptor in said file server change memory. 
13 

14 65. A method as in claim 65, including steps for associating a plurality of file 

1 5 servers with at least one mass storage element and at least one file server change memory. 
16 

17 66. A method as in claim 65, including steps for delaying output to a mass 

1 8 storage element without delaying a response to file server requests. 
!9 

20 67. A method as in claim 65, including steps for leaving a file storage system 

21 on said mass storage element in an internally consistent state after processing each said file sys- 

22 tern change. 
23 

24 68. A method as in claim 65, including steps for 

25 leaving a file storage system on said mass storage element in an internally con- 

26 sistent state after processing file system changes; 

27 associating said internally consistent state with a set of completed file system 

28 changes; and 

29 identifying said set of completed file system changes by at least one said file 

30 server. 
31 

32 69. A method as in claim 65, including steps for performing said received file 

33 server request at both a primary mass storage element and a mirror mass storage element. 
34 

35 70. A method as in claim 65, including steps for 

20 
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1 processing said file system changes atomically at a first said file server; and 

2 on failover processing exactly those file system changes not already processed by 

3 said first file server. 

4 

5 71. A method as in claim 65, including steps for responding identically at a 

6 first said file server to service interruptions for itself and for a second said file server. 
7 

8 72. A method as in claim 65, wherein said file server change memory in- 

9 eludes a disk block. 
10 

1 1 73. A method as in claim 65, wherein said file server change memory in- 

12 eludes a file server request. 
13 

N 74. A method as in claim 65, wherein said file server change memory is dis- 

15 posed to delay output to said mass storage element without delaying a response to file server re- 

16 quests. 

17 

18 75. A method as in claim 65, including steps for responding to a file system 

19 change before committing a result of said file system change to mass storage at one said file 

20 server. 
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